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1  Introduction:  What  in  a  reasonable  model? 

A  number  of  relatively  diverse  problems  are  often  referred  to  under 
the  topic  of  "parallel  computation".  The  viewpoint  of  this  paper  is 
that  of  a  "tightly  coupled"*  synchronized  (by  a  global  clock)  collection 
of  parallel  processors,  working  together  to  solve  a  terminating  computa¬ 
tional  problem.  Such  parallel  processors  already  exist  and  are  used  to 
solve  time  consuming  problems  in  a  wide  variety  of  areas  including  com¬ 
putational  physics*  weather  forecasting*  etc.  The  current  state  of 
hardware  capabilities  will  facilitate  the  use  of  such  parallel  proces¬ 
sors  to  many  more  applications  as  the  speed  and  the  number  of  processors 
that  can  be  tightly  coupled  increases  dramatically. 

Within  this  viewpoint,  Preparata  and  Viullemin  [7]  distinguish  two 
broad  categories.  Namely,  we  can  differentiate  between  those  models 
that  are  based  on  a  fixed  connection  network  of  processors  and  those 
that  are  based  on  the  existence  of  global  or  shared  memory.  In  the 
former  case*  we  assume  that  only  graph  theoretically  adjacent  processors 
can  communicate  in  a  given  step*  and  we  usually  assume  that  the  network 
is  reasonably  sparse;  as  examples,  consider  the  shuffle-exchange  network 
(Stone  [10])  and  its  development  into  the  Ultracomputer  of  Schwartz  [8], 
the  array  or  mesh  connected  processors  such  as  the  Illiac  IV,  the  cube- 
connected  cycles  of  Preparata  and  Viullemin  [7],  or  the  more  basic  n- 
dimensional  hypercube  studied  in  Valiant  and  Brebner  [12].  As  examples 
of  models  based  on  shared  memories*  there  are  the  PRAC  of  Angluin  and 
Valiant  [1],  the  PRAM  of  Fortune  and  Wyllie  [2],  the  unnamed  parallel 
model  of  Shiloach  and  Vishkin  [9],  and  the  SIMDAG  of  Goldschlager  [3]. 
Essentially  these  models  differ  in  whether  or  not  they  allow  fetch  and 


write  conflicts,  and  if  allowed,  how  write  conflicts  are  resolved. 

From  a  hardware  point  of  view,  fixed  connection  models  seem  more 
reasonable  and,  indeed,  the  global  memory-processor  interconnection 
would  probably  be  realized  in  practice  by  a  fixed  connection  network 
(see  Schwartz  [8]).  Furthermore,  for  a  number  of  important  problems 
(eg,  FFT,  bitonic  merge,  etc.)  either  the  shuffle-exchange  or  the  cube 
connected  cycles  provide  optimal  hosts  for  well  known  algorithms.  On 
the  other  hand,  many  problems  require  only  infrequent  and  irregular  pro¬ 
cessor  communication,  and  in  any  case  the  shared  memory  framework  seems 
to  provide  a  more  convenient  environment  for  constructing  algorithms. 
Finally,  in  defense  of  the  PRAM,  it  is  plausible  to  assume  thac  some 
broadcast  facilities  could  be  made  available. 

The  problem  of  sorting,  and  the  related  problem  of  routing  are  pro¬ 
totype  problems,  due  both  to  their  intrinsic  significance  and  their  role 
in  processor  communication.  Since  merging  is  a  (the)  key  subroutine  in 
many  sorting  strategies,  we  are  interested  in  merging  and  sorting  with 
respect  to  both  the  fixed  connection  and  shared  memory  models.  In  a 
companion  paper,  we  study  the  routing  problem  for  fixed  connection  net¬ 
works,  such  as  the  n-dimensional  cube.  For  such  a  machine,  the  complex¬ 
ity  of  merging  has  been  resolved  by  the  fundamental  log  n  algorithms  of 
Batcher  (see  Knuth  [5]  for  a  discussion  of  odd-even  and  bitonic  merge). 
The  lower  bound  in  this  regard  i6  immediate  because  log  n  is  the  graph 
theoretic  diameter.  In  this  paper,  we  concentrate  on  the  complexity  of 
merging  (with  application  to  sorting)  on  shared  memory  machines. 


II  A  Hierarchy  of.  Mfl.de la 


The  shared  memory  models  usually  studied  all  possess  a  global 
memory,  each  cell  of  which  can  be  read  or  written  by  any  processor.  For 
the  purpose  of  constructing  algorithms,  one  usually  assumes  a  single 
instruction  stream:  that  is.  one  program  is  executed  by  all  processors. 
However,  when  the  processor  number  itself  is  used  to  control  the 
sequencing  of  steps,  and  some  ability  to  synchronize  control  is  intro¬ 
duced.  then  the  effect  is  that  of  a  multiple  instruction  stream.  The 
processors  are  assumed  to  have  some  local  memory  and  each  processor  can 
execute  basic  primitive  operations  such  as  £,=,*  comparisons  and  integer 
+,-,x»*  arithmetic  operations  in  a  single  step.  The  following  models 
have  been  considered: 

1.  PRAC  (Angluin  and  Valiant)  -  Simultaneous  read  or  write  (of  the 
same  cell)  is  not  allowed. 

2.  PRAM  (Fortune  and  Wyllie)  -  Simultaneous  fetches  are  allowed  but  no 
simultaneous  writes. 

3.  WRAM  -  VRAM  denotes  a  variety  of  models  that  allow  simultaneous 
reads  and  (certain)  writes,  but  differ  in  how  such  write  conflicts 
are  to  be  resolved. 

a.  (Shiloach  and  Vishkin)  a  simultaneous  write  is  allowed  only  if 
all  processors  are  trying  to  write  the  same  thing,  otherwise  the 
computation  is  not  legal. 

b.  An  arbitrary  processor  is  allowed  to  write 

c.  (Goldschlager )  the  lowest  numbered  processor  is  allowed  to 
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Ocher  variants  are  clearly  possible.  We  are  concerned  with  the 
merging  and  sorting  problems  of  elements  from  an  arbitrary  linear  order 
(i.e.  the  schematic  or  structured  approach).  In  this  contexts  a  "most 
powerful"  parallel  model  (analagous  to  the  comparison  tree  for  sequen¬ 
tial  computation)  has  been  studied  by  Valiant.  The  parallel  computation 

«  k 

tree  idealizes  k-processor  parallelism  by  a  3  -tree  where  each  node  is 

labelled  by  a  set  of  k  {<*=*>}  comparisons  and  the  branches  are  labelled 

k 

by  each  of  the  3  possible  outcomes.  It  should  be  clear  that  for  the 
problems  of  concern*  parallel  computation  trees  can  simulate  any  reason¬ 
able  parallel  model*  and  in  particular*  can  simulate  all  of  the 
aforementioned  shared  memory  models. 

M 

Let  M  denote  any  of  these  models.  We  will  be  concerned  with  T 

J  merge 

M 

(n»m,p)  and  Tsort  (n,p),  the  minimum  number  of  parallel  steps  to  merge 
two  sorted  lists  of  n  and  m  elements  (respectively,  to  sort  n  arbitrary 
elements)  using  p  processors.  Typically,  n=m,  and  p=0(n)  or 
0(n  log01  n).  Clearly,  for  any  problem  we  have 

tprac  ^  tpram  4  twram. 

Our  main  contribution  is  to  establish  the  following  two  theorems: 

Theorem  1:  Let  M  denote  the  parallel  computation  tree  model.  Then 

T”  (n,n,n°)  =  fl(loglog  n)  for  o<2. 
merge 

Theorem  2j  ^n,n,n^  =  °(loglog  n). 

We  use  Valiant's  algorithm*  which  already  establishes  the  bound  for 
the  parallel  comparison  tree*  but  following  Valiant  [11],  Preparata  [6] 
and  Shiloach  and  Vishkin  [9]*  remark  that  a  "processor  allocation"  prob¬ 
lem  must  be  solved  to  realize  Valiant's  algorithm  on  the  PRAM  model. 
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Hence,  the  problem  of  merging  is  now  resolved  on  all  of  the  above  shared 
memory  models  except  the  PRAC,  for  which  we  cannot  improve  on  the  log  n 
upper  bound  of  the  Batcher  merge.  For  the  PRAC,  it  is  not  difficult  to 
6how  that  J2(\|logn)  is  a  lower  bound  for  insertion  (and  hence  merging); 
indeed  we  conjecture  that  insertion  requires  fi(log  n)  on  a  PRAC. 

With  regard  to  sorting,  we  have  the  following  direct  corollaries: 

Corollary  1:  TPRAM(n,n)  =  O(log  n  loglog  n). 

Corollary  2.:  TPRAM(n,n  log  n)  =  0(log  n). 

Clearly,  Corollary  1  follows  from  a  standard  merge  sort,  whereas 
Corollary  2  is  a  restatement  of  Preparata' s  [6]  result,  which  can  now  be 
stated  for  PRAM's  using  Theorem  2.  Corollaries  1  and  2  should  be  com¬ 
pared  with  the  Shiloach  and  Vishkin  upper  bound  of 
2  n 

O(log  i0g(p/n)  +  logn)  for  sorting  on  their  version  of  a  WRAM  with  p 
processors.  With  regard  to  lower  bounds  for  sorting,  Haagvist  and  Hell 
[4]  prove  that  in  terms  of  the  parallel  computation  tree,  time  less  than 
or  equal  to  k  implies  H(n^+^^)  processors  are  required  (and  this  is 
essentially  sufficient).  It  follows,  that  for  the  tree  model  and  any 
of  the  RAM  models,  a  (log  n  /log  log  n)  is  a  lower  bound  for  sorting 
with  Il(n  log°  n)  processors.  For  0(n)  processors,  U(log  n)  is  a  trivial 
lower  bound  resulting  from  the  sequential  lower  bound  of  fi(n  log  n). 
Among  the  open  questions  for  parallel  sorting  are  the  following:  the 
number  of  processors  for  O(log  n)  time  sorting  on  a  PRAC  (Preparata  [6] 
achieves  0(k  log  n)  time  with  n*+*^  processors);  whether  it  is  possible 


to  sort  in  time  o(log  n),  and  in  particular  in  time  0(1),  on  a  PRAC  or 
PRAM;  whether  it  is  possible  to  sort  in  time  0(1)  on  a  WRAM  using  only 
polynomial  in  n  number  of  processors. 
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III  A  Xl(loglog  n)  lower  bound  for  merging  on  Valiant's  model 

Sequentially  merging  two  lists  of  length  n  can  be  accomplished  with 
2n-l  comparisons  and  this  is  provably  optimal.  Since  only  2n-l  com¬ 
parisons  are  necessary  to  merge  two  such  lists,  conceivably  in  a  paral¬ 
lel  model  they  could  be  merged  in  time  0(1)  with  n  processors.  However, 
we  shall  show  that  this  is  not  possible.  Even  allowing  n°  for  a<l  com¬ 
parisons  per  step,  a  depth  of  loglog  n  is  needed. 


Consider  the  process  of  merging  two  sorted  lists  a^ . a^  and 

bj,  .  .  .  ,  bQ  with  n  processors.  At  the  first  step  at  most  n  comparis¬ 
ons  can  be  made.  Partition  each  list  into  2\|n  blocks  of  length  ^l1** 
Form  pairs  of  blocks,  one  from  each  list.  There  are  4n  such  pairs  of 
blocks.  Clearly  there  must  be  3n  pairs  (A^.Bj)  of  blocks  such  that  no 
element  from  the  block  A^  is  compared  with  any  element  from  the  block 
Bj.  We  shall  show  that  we  can  select  pairs  of  blocks 

(A.  . B .  ) , (A .  ,B.  ),...,  (A.  »B .  ) 

h  J1  X2  J2  lLvf  JLvr 

2^1 n  2\|n 

such  that  and  f°r  n»  If  the  total  order  is  such 

that  all  elements  in  A.  uB.  are  less  than  any  element  in 

ll  J1 

A.  UB.  .lSl^fn.  then  after  the  first  stage  we  are  faced  with  ^\|n 
ll+l  Jl+1  1 


subproblems  each  of  size 


At  the  second  stage  the  n  processors  are  partitioned  somehow  among 
the  2^ln  subproblems.  However  this  is  done,  at  least  one  half  of  the 
subproblems  have  assigned  to  them  fewer  than  twice  the  average  available 
number  of  processors  per  subproblem.  Thus  there  are  ^\|n  subproblems 
with  at  most  4\|n  processors  per  problem.  Intuitively  this  argument 
suggests  that  at  each  stage  the  size  of  subproblem  goes  down  by  a  square 


-  8  - 


root  and  hence  loglog  n  time  is  necessary.  These  ideas  will  be  made 
precise  in  the  following  lemmas. 

In  what  follows  let  G  =  (AuB.E)  be  a  bipartite  graph  with  EcAxB. 

Further  let  A^,A2*  •  •  •  and  B^.Bj.  «  •  •  be  fixed  ordeiings  of  the 

vertices  in  A  and  B,  respectively.  A  matching  is  said  to  be  compatible 

if  for  each  pair  of  edges  (A  ,B.  )  and  (A.,B.)  in  the  matching  i<h  if  and 

g  h  i  j 

only  if  j<g. 

Lemma i  Let  G=(AuB,E)  be  a  bipartite  graph  with  A  =  A^.A^*  •  •  •  and 

2 

B  =  Bj,B2»  •  •  •  .B2k  and  let  E  c  AxB  have  3k  edges.  Then  G  has  a  com¬ 
patible  matching  of  cardinality  at  least  k. 

Proof :  Partition  the  edges  into  2k-l  blocks  as  follows.  For  -k<b<k  we 

have  a  block  consisting  of  edges  {(A^,B^+b) | l£i<2k  and  l<i+b£2k}.  In 

2 

addition  we  have  one  block  consisting  of  all  other  edges.  At  most  2k 

2 

edges  fall  into  the  block  of  other  edges.  Thus  at  least  k  edges  must 
be  partitioned  into  2k-l  blocks.  Hence  at  least  one  block  must  have  at 
least  k  edges  and  these  edges  form  a  compatible  matching. 

Lemma ;  Let  T(s,c)  be  the  time  necessary  to  6olve  k,  k£l»  problems  of 
size  s  with  cks  processors.  Then  T(s,c)  is  Jd( Sc) . 

Proof ;  On  the  average  we  can  assign  cs  processors  to  each  problem.  At 
least  one  half  of  the  problems  can  have  no  more  than  twice  this  number 
of  processors  assigned  to  them.  That  is,  at  least  k/2  problems  have  at 
most  2cs  processors. 

Consider  applying  2cs  processors  to  a  problem  of  size  s.  This 
means  that  in  the  first  step  we  can  make  at  most  2cs  comparisons.  Par¬ 
tition  the  lists  into  2\|2cs  blocks  each  of  size 


ik 

2\|  2c* 


There  are  8cs 
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pairs  of  blocks.  Thus  there  must  be  6cs  pairs  of  blocks  with  no  com¬ 
parisons  between  elements  of  the  blocks  in  a  pair.  Construct  a  bipar¬ 
tite  graph  whose  vertices  are  the  blocks  from  the  two  lists  with  an  edge 
between  two  blocks  if  there  are  no  comparisons  between  elements  of  the 
two  blocks.  Clearly  there  are  6cs  edges  and  thus  by  the  previous  lemma 
there  is  a  compatible  match  of  size  at  least  ^\|2cs.  This  means  that 

there  are  at  least  IjAl^cs  problems  each  of  size  at  least  2\|^  that  we 

must  still  solve.  Thus  T(s,c)  ^  1+T(^  |i*4c)* 


We  show  by  induction  on  s. 


T(s.c)  *  dlog^f^ 


that 


for  some  sufficiently  small  d. 


T(s,c)  £  1+dlog 
£  1+dlog 


loeX> 

logAc 

logc 


1+dlog  J£^f-dlog4 


2  dlog 


logc 


provided  d<^.  Observe  that  log^0^^  is  jQCloglog  s  -  loglog  c)  which 
matches  Valiant's  upper  bound  of  2( loglog  s  -  loglog  c). 


IV  An  Of  1 ogl og  a)  upper  bound  for  merging  on  a  PRAM 

We  recall  Valiant's  (n*m)  merging  algorithm  which  merges  X  and  Y 
with  #X=n,  #Y=m,  n£m  using  \jnm  processors.  Our  goal  is  to  implement 
Valiant' 8  algorithm  on  a  PRAM.  The  algorithm  (taken  verbatim  from  Vali¬ 
ant  [11])  proceeds  as  follows: 
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(a)  Mark  the  elements  of  X  that  are  subscripted  by  i  •  T\jnl  and  those 

of  Y  subscripted  by  i’lAjml  for  i  =  1,2,  ....  There  are  at 

most  L\|nJ  and  L\]mJ  of  these,  respectively.  The  sublists  between 
successive  marked  elements  and  after  the  last  marked  element  in 
each  list  we  call  segments. 

(b)  Compare  each  marked  element  of  X  with  each  marked  element  of  Y. 
This  requires  no  more  than  L\|nmJ  comparisons  and  can  be  done  in 
unit  time. 

(c)  The  comparisons  of  (b)  will  decide  for  each  marked  element  the  seg¬ 
ment  of  the  other  list  into  which  it  needs  to  be  inserted.  Now 
compare  each  marked  element  of  X  with  every  element  of  the  segment 
of  Y  that  has  thus  been  found  for  it.  This  requires  at  most 

L\|nj  •  (r\kl  -  1)  <  L\jrimJ 

comparisons  altogether  and  can  also  be  done  in  unit  time. 

On  the  completion  of  (a),  (b)  and  (c)  we  can  store  each  X  _  in 

ir\|n1 

its  appropriate  place  in  the  output  Z.  It  then  remains  to  merge  the 
disjoint  pairs  of  sublists  (Xq.Yq) , (Xj.Yj) ,  •  •  •  where  the  X^  and  Y^ 
are  segments  of  X  and  Y  respectively.  Whereas  Chauchy* s  inequality 
guarantees  that  there  will  be  enough  processors  to  carry  out  these 
independent  merges  by  simultaneous  recursive  calls  of  the  algorithm,  it 
is  not  clear  how  to  inform  each  processor  to  which  (X^,Y^)  subprogram 
(and  in  what  capacity)  it  will  be  assigned.  This  is  the  main  concern  in 
what  Shiloach  and  Vishkin  [9]  refer  to  as  the  processor  allocation  proh- 
lem. 
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We  desire  a  recursive  procedure  (in  fact,  a  macro  might  be  more 
appropriate)  MERGE (i,n^, j,m^,k)  which  merges  Xi»Xi+i»  .  .  .  ,  X^+R 

and  Y.,  .  .  .  ,  Y . .  into  Z Z  ....  .  «  using  at  most 

j  j+m  ,  i+j-1  i+j+n.+m.-2 


i 

^jn^m^  processors  beginning  at  processor  number  p^.  Such  a  merge  will 

be  simultaneously  invoked  by  processors  P^Pfc+i*  •  •  •  »P  _  • 

k+Jn.m.-l 

The  initial  call  is  MERGE(l,n,l,m,l).  As  we  enter  this  subroutine,  a 
processor  p^  will  know  from  i*j»n£,m^,  and  k,  the  (relative)  role  it 
plays  in  parts  (a),  (b)  and  (c)  of  Valiant's  algorithm.  For  example, 
say  n^m^  and  let 

1  =  k  +  x*  F \J n^~l  +  j'  Oii'ilLyJn^  J  -  1 

0Sj'iL^miJ  -  1 


then  in  step  (a),  processor  p.  compares  X  _  and  Y  _  . 

i,#\|ni+i  j,*\|mi+j 

We  will  now  indicate  how  processors  reassign  themselves  before 
recursively  invoking  the  merge  routine.  For  simplicity,  assume  that  we 
have  just  completed  steps  (a),(b),(c)  of  MERGE(l,n,l,m,l).  We  can 
assume  that  we  have  determined  for  each  i,  0£i£L\(nJ-l  that 
Y  _  <X.:£Y  _  and  that  we  have  constructed  a  table  J 

jir\lmi  1  (ji+1)r\|mi 


0 

D 

D 

j  r 
\|  n  —  1 

m 

accessible  by  all  \|nm  processors.  A  given  processor  p  must  determine 
its  role  in  the  next  iteration  of  the  algorithm. 


T.pimna :  Suppose  (Xq,Yq),  ...  •  (xr_i have  been  assigned  proces¬ 
sors,  and  X  .  =  (,..,X  _ )  and  Y  .  =  (...,y,).  There  exists  a 

r-i  .i  ,  r-i  i 

r\|n  -  1 
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function  4  such  that  no  more  than  e!(r,n,f)  processors 
assigned.  Indeed.  4( r.n.f)  =  r  •  (The  bound 

achieved  by  considering  the  case  Y^  =  f/r  for  all  i  £  r.) 


have  been 


rn 


1/4 


is 


Proof :  Cauchy* s  inequality  will  do  the  job  here. 


The  impact  of  this  Lemma  is  that  we  can  safely  assign  processors 

P«Kr.n.f)+l*  *  ‘  *  *fy(i+l,n.f)  t0  <Xr*V  U  ranains  for  each  Processor 

to  know  to  which  (X^.Y^)  it  will  be  assigned.  Indeed,  once  a  processor 

knows  to  which  (X^.Y^)  it  has  been  assigned,  then  it  can  obtain  all  the 

information  it  will  need  to  invoke  MERGE  from  the  table  J  and  the  4 

function;  namely,  X^  starts  at  hTNlnl  and  has  length  L\[nJ-l»  Y^  starts 

at  Y.  and  has  length  j.+^-j^+l,  and  the  processors  assigned  to  this 
^k 

task  began  at  Prf(k,n,  j^+D* 


.Y^)  subproblem 

proceeds  in  two  stages  (note  that  we  cannot  simply  do  a  sequential 
binary  search  in  J  because  this  would  require  log\|m  steps): 


The  actual  assignment  of  a  processor  to  a  (X^ 


Stage  1) 

Yk 

no  more  than  \|n  processors  need  be  assigned  to  this  task  since 
#X^=\|n-l  for  all  i. 

Stage  2) 

Stage  1:  For  each  k,  0  S  k£  \|m-l,  we  assign  \|n  processors  to  look  at 
both  the  k  and  the  k+lst  entry  of  the  table  J.  If  jj^j- j^sNjn,  then 
these  \|n  processors  inform  (by  posting  the  information  in  an  appropri- 


Processors  are  assigned  to  the  remaining  (X^ 


Processors  are  assigned  for  those  (X^i 


)  with  #Yk^\)n  (and  hence 
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ate  place  of  global  memory)  processors  numbered 
^(k,n,  jj{_^)+l,  •  •  •  ,tf(k+l  ,n,  j^)  that  they  are  assigned  to  (X^.Y^).  We 
then  wait  until  the  completion  of  Stage  2  before  invoking  merge  on 
(Xk«Yk)  since  all  \|nm  processors  are  needed  for  Stage  2. 

Stage  2.:  The  processors  are  divided  into  \|m  blocks,  each  block  contain¬ 
ing  \|n  processors.  Each  of  the  \|n  processors  in  a  block  are  trying  to 

determine  to  which  X.  ,Y,  these  \|n  processors  will  be  assigned.  Let  p. 

*  * 

be  the  first  processor  of  block  1.  The  kth  processor  of  block  1  looks 
at  the  and  jk+^  in  table  J  and  determines  (via  the  function  tf) 

whether  or  not  processor  p.  would  be  assigned  to  this  subproblem.  Now 

J1 

each  processor  p  in  the  1th  block  can  determine  (again  via  table  J  and 
«0  which  of  the  following  hold: 

i)  p  is  assigned  to  (X^.Y^),  the  subproblem  assigned  to  p^ 

ii)  p  is  assigned  to  (X^.Y^,),  the  subproblem  assigned  to  p. 

Jl+1 

iii)  p  has  already  been  assigned  in  Stage  1. 

We  claim  that  if  neither  i)  and  ii)  hold,  then  iii)  must  hold  since 
clearly  less  than  \|n  processors  have  been  assigned  to  the  same  task  as 

P* 

0 

1  Application  la  sorting  and  open  problems 

Preparata  [6]  derives  a  set  of  parallel  sorting  algorithms,  all 
based  on  what  Knuth  [5]  calls  enumeration  sorting.  The  "count  acquisi¬ 
tion  stage”  is  often  accomplished  by  merging.  Using  Batcher’s  merge. 
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sorting  can  then  be  performed  in  0(k  log  n)  steps  on  a  PRAC  using 
processors.  More  to  the  thrust  of  this  paper,  Preparata  shows  how 
Valiant’s  merge  can  be  used  to  derive  a  O(log  n)  time  sort  using  n  log  n 
processors.  Since  we  have  shown  how  to  implement  Valiant's  merge  on  a 
PRAM,  Preparata' s  bounds  will  now  be  applicable  to  the  PRAM  model.  It 
is  also  clear  that  with  only  n  processors,  a  merge  sort  will  take  O(log 
n  loglog  n)  time. 

A  number  of  open  problems  concerning  time  vs  number  of  processors 
are  readily  suggested  by  the  above  comments.  We  can  classify  two  sets 
of  questions: 

1)  The  number  of  processors  required  for  an  O(log  n)  time  sort  on  the 
various  models,  the  present  upper  bounds  being  n*+^k  (PRAC),  n  log 
n  (PRAM,  VRAM,  parallel  computation  tree).  In  all  cases,  n  proces¬ 
sors  is  an  obvious  lower  bound. 

2)  For  what  models  is  it  possible  to  sort  in  o(log  n)  time  and,  if 
possible,  how  many  processors  are  required?  In  particular,  in  0(1) 
time,  sorting  can  be  done  using  0(2n)  processors  on  a  WRAM  or  in 
constant  time  k  using  n^+^k  processors  on  a  parallel  computation 
tree.  We  do  not  know  if  such  a  fast  sort  is  possible  for  the  PRAC 
or  PRAM. 
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